30 research outputs found
Fast rates for empirical vector quantization
We consider the rate of convergence of the expected loss of empirically
optimal vector quantizers. Earlier results show that the mean-squared expected
distortion for any fixed distribution supported on a bounded set and satisfying
some regularity conditions decreases at the rate O(log n/n). We prove that this
rate is actually O(1/n). Although these conditions are hard to check, we show
that well-polarized distributions with continuous densities supported on a
bounded set are included in the scope of this result.Comment: 18 page
Quantization/clustering: when and why does k-means work?
Though mostly used as a clustering algorithm, k-means are originally designed
as a quantization algorithm. Namely, it aims at providing a compression of a
probability distribution with k points. Building upon [21, 33], we try to
investigate how and when these two approaches are compatible. Namely, we show
that provided the sample distribution satisfies a margin like condition (in the
sense of [27] for supervised learning), both the associated empirical risk
minimizer and the output of Lloyd's algorithm provide almost optimal
classification in certain cases (in the sense of [6]). Besides, we also show
that they achieved fast and optimal convergence rates in terms of sample size
and compression risk
Une fonction distance à k points pour l'inférence géométrique robuste
Analyzing the sub-level sets of the distance to a compact sub-manifold of R d is a common method in topological data analysis, to understand its topology. Therefore, topological inference procedures usually rely on a distance estimate based on n sample points [41]. In the case where sample points are corrupted by noise, the distance-to-measure function (DTM, [16]) is a surrogate for the distance-to-compact-set function. In practice, computing the homology of its sub-level sets requires to compute the homology of unions of n balls ([28, 14]), that might become intractable whenever n is large. To simultaneously face the two problems of a large number of points and noise, we introduce the k-power-distance-to-measure function (k-PDTM). This new approximation of the distance-to-measure may be thought of as a k-pointbased approximation of the DTM. Its sublevel sets consist in unions of k balls, and this distance is also proved robust to noise. We assess the quality of this approximation for k possibly drastically smaller than n, and provide an algorithm to compute this k-PDTM from a sample. Numerical experiments illustrate the good behavior of this k-points approximation in a noisy topological inference framework.Afin de comprendre la topologie d'une sous-variété compacte de R^d, il est courant en analyse topologique des données d'analyser les sous-niveaux de la fonction distance à cette sous-variété. C'est pourquoi, les procédures d'inférence topologique reposent souvent sur des estimées de la fonction distance, construites sur n points. Lorsque l'échantillon de points est corrompu par des données aberrantes, la fonction distance à la mesure (DTM) est une alternative à la distance au compact. En pratique, le calcul de l'homologie de ses sous-niveaux revient à calculer l'homologie d'unions de n boules, ce qui devient impossible lorsque n est grand. Afin de pallier simultanément le problème du grand nombre de points et du bruit, nous introduisons la fonction k-puissance distance à la mesure (k-PDTM). Il s'agit d'une nouvelle approximation de la fonction distance à la mesure qui peut être vue comme une approximation de la DTM basée sur k points. Ses sous-niveaux sont des unions de k boules, et cette distance est robuste au bruit. Nous étudions la qualité de cette approximation pour k possiblement beaucoup plus petit que n, et fournissons un algorithme permettant de calculer cette k-PDTM à partir d'un échantillon de points. Des expériences numériques illustrent le bon comportement de cette approximation construite sur k points, dans le cadre de l'inférence topologique avec bruit
Non Asymptotic Bounds for Vector Quantization in Hilbert Spaces
30 pages, technical proofs are omitted and can be found in the related unpublished paper "Margin conditions for vector quantization"Recent results in quantization theory show that the convergence rate for the mean-squared expected distortion of the empirical risk minimizer strategy, for any fixed probability distribution satisfying some regularity conditions is O(1/n), where n is the sample size. However, the dependency of the average distortion on other parameters is not known. This paper offers more general conditions, which may be thought of as margin conditions, under which a sharp upper bound on the expected distortion rate of the empirically optimal quantizer is derived. This upper bound is also proved to be sharp with respect to the dependency of the distortion on other natural parameters of the quantization issue
Optimal quantization of the mean measure and applications to statistical learning
This paper addresses the case where data come as point sets, or more
generally as discrete measures. Our motivation is twofold: first we intend to
approximate with a compactly supported measure the mean of the measure
generating process, that coincides with the intensity measure in the point
process framework, or with the expected persistence diagram in the framework of
persistence-based topological data analysis. To this aim we provide two
algorithms that we prove almost minimax optimal. Second we build from the
estimator of the mean measure a vectorization map, that sends every measure
into a finite-dimensional Euclidean space, and investigate its properties
through a clustering-oriented lens. In a nutshell, we show that in a mixture of
measure generating process, our technique yields a representation in
, for that guarantees a good clustering of
the data points with high probability. Interestingly, our results apply in the
framework of persistence-based shape classification via the ATOL procedure
described in \cite{Royer19}
Optimal quantization of the mean measure and application to clustering of measures
This paper addresses the case where data come as point sets, or more generally as discrete measures. Our motivation is twofold: first we intend to approximate with a compactly supported measure the mean of the measure generating process, that coincides with the intensity measure in the point process framework, or with the expected persistence diagram in the framework of persistence-based topological data analysis. To this aim we provide two algorithms that we prove almost minimax optimal. Second we build from the estimator of the mean measure a vectorization map, that sends every measure into a finite-dimensional Euclidean space, and investigate its properties through a clustering-oriented lens. In a nutshell, we show that in a mixture of measure generating process, our technique yields a representation in , for that guarantees a good clustering of the data points with high probability. Interestingly, our results apply in the framework of persistence-based shape classification via the ATOL procedure described in \cite{Royer19}
La k-PDTM : un coreset pour l'inférence géométrique
Analyzing the sub-level sets of the distance to a compact sub-manifold of R d is a common method in TDA to understand its topology. The distance to measure (DTM) was introduced by Chazal, Cohen-Steiner and Mérigot in [7] to face the non-robustness of the distance to a compact set to noise and outliers. This function makes possible the inference of the topology of a compact subset of R d from a noisy cloud of n points lying nearby in the Wasserstein sense. In practice, these sub-level sets may be computed using approximations of the DTM such as the q-witnessed distance [10] or other power distance [6]. These approaches lead eventually to compute the homology of unions of n growing balls, that might become intractable whenever n is large. To simultaneously face the two problems of large number of points and noise, we introduce the k-power distance to measure (k-PDTM). This new approximation of the distance to measure may be thought of as a k-coreset based approximation of the DTM. Its sublevel sets consist in union of k-balls, k << n, and this distance is also proved robust to noise. We assess the quality of this approximation for k possibly dramatically smaller than n, for instance k = n 1 3 is proved to be optimal for 2-dimensional shapes. We also provide an algorithm to compute this k-PDTM.L'analyse des sous niveaux de la fonction distance à une variété compacte de R d est très fréquente en analyse topologique des données, avec pour objectif d'en comprendre la topologie. La distance à la mesure (DTM) a été introduite par Chazal, Cohen-Steiner et Mérigot avec l'objectif de remédier au caractère non robuste au bruit et aux données aberrantes de la distance à un compact. Cette fonction rend possible l'inférence de la topologie d'un sous-ensemble compact de R d à partir d'un nuage de n points tirés dans un voisinage proche de la sous-variété au sens de Wasserstein. En pratique, les sous-ensembles de niveau de cette fonction peuvent être estimés en utilisant des approximations de la DTM tels que la q-witnessed distance ou d'autres fonctions puissance. Ces approches reviennent à calculer l'homologie de l'union de n boules, ce qui devient impossible en pratique lorsque n devient trop grand. Afin de traiter le problème du grand nombre de points et du bruit, on introduit la fonction k-puissance distance à la mesure (k-PDTM). Cette nouvelle approximation de la distance à la mesure peut être vue une approximation de la DTM s'appuyant sur un -coreset. Ses sous-niveaux seront alors des unions de k boules pour k<<n, et cette fonction est également robuste au bruit. On étudie la qualité de cette approximation lorsque k est très petit par rapport à n. Par exemple, le choix de k=n^{1/3} est optimal pour des formes en dimension 2. On fournit également un algorithme pour calculer cette fonction k-PDTM
ATOL: Measure Vectorisation for Automatic Topologically-Oriented Learning
Robust topological information commonly comes in the form of a set of persistence diagrams, finite measures that are in nature uneasy to affix to generic machine learning frameworks. We introduce a learnt, unsupervised measure vectorisation method and use it for reflecting underlying changes in topological behaviour in machine learning contexts. Relying on optimal measure quantisation results the method is tailored to efficiently discriminate important plane regions where meaningful differences arise. We showcase the strength and robustness of our approach on a number of applications, from emulous and modern graph collections where the method reaches state-of-the-art performance to a geometric synthetic dynamical orbits problem. The proposed methodology comes with only high level tuning parameters such as the total measure encoding budget, and we provide a completely open access software